Model description
The model underlying the “gjam” package is described in the article “Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data” (Clark et al. (2016)).
GJAM model accommodates the median-zero, multivariate, multifarious data and allows joint modeling of continuous and discrete observations (multifarious observations), avoiding non-linear transformation (link) functions. Combination of the different types of observations is implemented with the use of censoring and effort, where censoring allows to transfer from continuous to discrete cases and effort allows combining discrete and continuous observations with appropriate weight.
Works with data inputs which could be: presence-absence, ordinal, continuous, discrete, zero-inflated or censored. Joint species distribution provides inference on sensitivity to input variables, the correlation between species on data scale, prediction, the definition of community structure and missing data imputation. Inverse prediction - prediction of environmental conditions from species.
The model address the following challenges
- Median-zero: most of the values in data sets are 0
- Multivariate: species are not independent
- Multifarious: different data types. (combine responses on different scales, eg species count + plot condition)
Data types
- CA continous abundance
- DA discrete abundance
- PA presence-absence
- OC ordinal count
- CC composition-count
- FC fractional composition
- CAT categorical data
General model
Observations: \((x_i,y_i)\) - sample , \(i=1 \dots n\), \(x_i\) vector of predictors \(q=1 \dots Q\), \(y_i\) vector of responses, \(y_{is}\), \(s=1 \dots S\), could consist of different types of observations. Observed values of Y are represetned by continous W values and discrete Z values.
\(w_i \in R^S\) locates \(y_i\) in continous space.\(w_{is}\) is known if the response \(y_{is}\) is continous, or unknown if the \(y_{is}\) is discrete.
\(z_i \in \{0, \dots ,K-1\}^S\) locates \(y_i\) in discrete space.Each observed \(y_{is}\) is placed in the interval \(z_{is}\). Number of intervals could differ for species and observations.
Each ecological attribute is accomodated by different combinations of known and unknown \(\{W,Z,P\}\), with C subset of censoring intervals. W, Z latent variables and partition P.
Partition P: the partition of continous space at points \(p_{is,k} \in P\) on \(\mathbb{R}\) defines discrete intervals \(z_{is}\), \((p_{is,k},p_{is,k+1}]\) boudns \(k^{th}\) interval of s in observation i. For discrete observations, k is censored and \(w_{is}\) is latent variable.
\[ y_{is}=\begin{cases} w_{is} \text{ }\text{ continous}\\ z_{is} , w_{is} \in (p_{z_{is}},p_{z_{is}+1}] \text{ discrete}\\ \end{cases} \]
Effort \(E_{is}\) affects the partition for discrete data. Partition for interval k depends on \(E_{is}\)
When effort varies between observations: \[(p_{i,k}, p_{i,k+1}] = (\frac{k-1/2}{E_i},\frac{k+1/2}{E_i}]\]
\[w_i \mid x_i,y_i,E_i \sim MVN(\mu_i, \Sigma) \times \prod_{s=1}^{S}I_{is}\] \[\mu_i= B'x_i\] \[I_{is} = \prod_{k \in C} I_{is,k}^{I(y_{is}=k)} (1- I_{is})^{I(y_{is} \neq k)}\] where \(I_{is}= I(p_{z_{is}} < w_{is}< p_{z_{is}}] \text{ C set of discrete intervals}\), B matrix of coefficients, \(\Sigma\) covariance matrix.
Algorithm:
Input X is centred and standartized. B and \(\Sigma\) are sampled
1. \(\Sigma \mid W,B\)
2. \(B \mid \Sigma, W\)
3. For unknown partition, the partition is sampled \(P \mid Z,W\) 4. For ordinal, presence-absence and categorical data, latent variables are drawn on correlation scale \(W \mid R,\alpha, P\), where \(R= D^{-\frac{1}{2}}\Sigma D^{\frac{1}{2}}\), \(\alpha = D^{-\frac{1}{2}}B\), \(P= D^{-\frac{1}{2}}P\) , \(D=diag(\Sigma)\). For other that are discrete or censored, latent variables are sampled on the covariance scale \(W \mid \Sigma, B,P\)
Model with dimension reduction
Model is defined in the article D.Taylor-Rodriguez, K. Kaufeld, E.M. Schilep,J.S. Clark, A.E. Gelfand “Joint Species Distribution Modeling: Dimension Reduction Using Dirichlet Processes”(2017) \[V_i= Bx_i + \epsilon_i, \text{ with }\epsilon_i \sim N_S(0_s,\Sigma)\] Approximate \(\Sigma\) with \(\Sigma'\) \[\Sigma' = AA^T + \sigma^2I, \text{ where A is } S\times r \text{ matrix}\]
\[V_i= Bx_i + Aw_i+\epsilon_i , \text{with } \epsilon_i \sim N_S(0_s,\sigma^2I), w_i \sim N_r(0_r,I_r)\]
Gjam for simluated data
Using “gjamSimData” function.
library(gjam)
library(ggplot2)
library(gridExtra)
library(knitr)
## Example:
S <-5
f <- gjamSimData(n = 200, S = S, Q = 4, typeNames = 'PA')
ml <- list(ng = 50, burnin = 5, typeNames = f$typeNames, holdoutN = 10)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
Observations and responses:
[1] 200 5
======================================================================================================================================================
##Predicted 5 Species and
# predict data
#par(mfrow=c(1,3),bty='n')
#gjamPredict(out, y2plot = colnames(f$ydata)) #predict the data in-sample
#title('full sample')
# out-of-sample prediction
xdata <- f$xdata[1:20,]
xdata[,3] <- mean(f$xdata[,3]) # mean for x[,3]
xdata[,4] <- mean(f$xdata[,4])# mean for x[,4]
xdata[,2] <- seq(-2.5,2.5,length=20) # gradient x[,2]
newdata <- list(xdata = xdata, nsim = 50 )
p1 <- gjamPredict(out, newdata = newdata)===========================================================================
# plus/minus 1 prediction SE, default effort = 1000
x2 <- p1$x[,2]
ylim <- c(0, max(p1$sdList$yMu[,1] + p1$sdList$yPe[,1]))
par(mfrow=c(2,3))
plot(x2, p1$sdList$yMu[,1],type='l',lwd=2, ylim=c(0, max(p1$sdList$yMu[,1] + p1$sdList$yPe[,1])), xlab='x2',
ylab = 'Predicted')
mtext(paste(expression(beta),"=",round(out$parameters$betaMu[2,1],2)), side = 3)
lines(x2, p1$sdList$yMu[,1] + p1$sdList$yPe[,1], lty=2)
lines(x2, p1$sdList$yMu[,1] - p1$sdList$yPe[,1], lty=2)
# .95 prediction error
title('SE and prediction, Sp 1')
plot(x2, p1$sdList$yMu[,2],type='l',lwd=2, ylim=c(0, max(p1$sdList$yMu[,2] + p1$sdList$yPe[,2])), xlab='x2',
ylab = 'Predicted')
mtext(paste(expression(beta),"=",round(out$parameters$betaMu[2,2],2)), side = 3)
lines(x2, p1$sdList$yMu[,2] + p1$sdList$yPe[,2], lty=2)
lines(x2, p1$sdList$yMu[,2] - p1$sdList$yPe[,2], lty=2)
title('SE and prediction, Sp 2')
plot(x2, p1$sdList$yMu[,3],type='l',lwd=2, ylim=c(0, max(p1$sdList$yMu[,3] + p1$sdList$yPe[,3])), xlab='x2',
ylab = 'Predicted')
mtext(paste(expression(beta),"=",round(out$parameters$betaMu[2,3],3)), side = 3)
lines(x2, p1$sdList$yMu[,3] + p1$sdList$yPe[,3], lty=2)
lines(x2, p1$sdList$yMu[,3] - p1$sdList$yPe[,3], lty=2)
title('SE and prediction, Sp 3')
plot(x2, p1$sdList$yMu[,4],type='l',lwd=2, ylim=c(0, max(p1$sdList$yMu[,4] + p1$sdList$yPe[,4])), xlab='x2',
ylab = 'Predicted')
mtext(paste(expression(beta),"=",round(out$parameters$betaMu[2,4],2)), side = 3)
lines(x2, p1$sdList$yMu[,4] + p1$sdList$yPe[,4], lty=2)
lines(x2, p1$sdList$yMu[,4] - p1$sdList$yPe[,4], lty=2)
title('SE and prediction, Sp 4')
plot(x2, p1$sdList$yMu[,5],type='l',lwd=2, ylim=c(0, max(p1$sdList$yMu[,5] + p1$sdList$yPe[,5])), xlab='x2',
ylab = 'Predicted')
mtext(paste(expression(beta),"=",round(out$parameters$betaMu[2,5],5)), side = 3)
lines(x2, p1$sdList$yMu[,5] + p1$sdList$yPe[,5], lty=2)
lines(x2, p1$sdList$yMu[,5] - p1$sdList$yPe[,5], lty=2)
#lines(x2, p1$sdList$yMu[,1] + p1$sdList$yPe[,1], lty=2)
#lines(x2, p1$sdList$yMu[,1] - p1$sdList$yPe[,1], lty=2)
# .95 prediction error
#lines(x2, p1$piList$yLo[,1], lty=3)
#lines(x2, p1$piList$yHi[,1], lty=3)
title('SE and prediction, Sp 5')
library(qgraph)
cormat= out$parameters$corMu #correlation matrix generated
qgraph(cormat, shape="circle", posCol="darkred", negCol="darkblue", layout="groups", vsize=10)kable(out$parameters$sigMu)| S1 | S2 | S3 | S4 | S5 | |
|---|---|---|---|---|---|
| S1 | 0.9508193 | 0.1902954 | 0.0663631 | 0.1501675 | 0.1101471 |
| S2 | 0.1902954 | 0.9590738 | 0.0103810 | -0.3015545 | 0.3241370 |
| S3 | 0.0663631 | 0.0103810 | 0.9821159 | 0.0051538 | -0.0675015 |
| S4 | 0.1501675 | -0.3015545 | 0.0051538 | 0.9608596 | 0.0169967 |
| S5 | 0.1101471 | 0.3241370 | -0.0675015 | 0.0169967 | 0.9286560 |
cormat= out$parameters$ematrix #correlation matrix generated
qgraph(cormat, shape="circle", posCol="darkred", negCol="darkblue", layout="groups", vsize=10)########## Partial Dependence plot
n.sample<- 200
#xseq <- seq(-2,2,length=20) # gradient x[,2]
presence_array<- matrix(nrow=n.sample, ncol =5)
for (i in 1:n.sample) {
xtrain <- f$xdata[1:200,]
#xtrain[,2] <- xseq[i] # value for ith predictor duplicate for all set
xtrain[,2] <-xtrain[i,2]
newdata <- list(xdata = xtrain, nsim = 50)
p_tr <- gjamPredict(out, newdata = newdata)
presence_array[i,]<- colSums(p_tr$sdList$yMu)/n.sample
}par(mfrow=c(2,3))
plot(f$xdata[1:200,2],presence_array[,1],xlab=expression(x[2]),main="",ylab="Prediction of occurence S1",lwd=2)
rug(f$xdata[1:200,2], col="red")
plot(f$xdata[1:200,2],presence_array[,2],xlab=expression(x[2]),main="",ylab="Prediction of occurence S2",lwd=2)
rug(f$xdata[1:200,2], col="red")
plot(f$xdata[1:200,2],presence_array[,3],xlab=expression(x[2]),main="",ylab="Prediction of occurence S3",lwd=2)
rug(f$xdata[1:200,2], col="red")
plot(f$xdata[1:200,2],presence_array[,4],xlab=expression(x[2]),main="",ylab="Prediction of occurence S4",lwd=2)
rug(f$xdata[1:200,2], col="red")
plot(f$xdata[1:200,2],presence_array[,5],xlab=expression(x[2]),main="",ylab="Prediction of occurence S5",lwd=2)
rug(f$xdata[1:200,2], col="red")n.sample<- 20
#xseq <- seq(-2,2,length=20) # gradient x[,2]
presence_array<- matrix(nrow=n.sample, ncol =5)
pr_array<-array(0,dim=c(n.sample,n.sample,5))
for (i in 1:n.sample) {
for (j in 1:n.sample){
xtrain <- f$xdata[1:20,]
xtrain[,2] <-xtrain[i,2]
xtrain[,3] <-xtrain[j,3]
newdata <- list(xdata = xtrain, nsim = 50)
p_tr <- gjamPredict(out, newdata = newdata)
pr_array[i,j,]<- colSums(p_tr$prPresent)/n.sample
}
}
pred_function <- function(x,y,s=3) {
xtrain <- f$xdata[1:20,]
xtrain[,2] <-x
xtrain[,3] <-y
newdata <- list(xdata = xtrain, nsim = 50)
p_tr <- gjamPredict(out, newdata = newdata)
pr_array<- colSums(p_tr$prPresent)/n.sample
return(pr_array[s])
}
x<- sort(f$xdata[1:20,2])
y<- sort(f$xdata[1:20,3])
z<- outer(x,y,Vectorize(pred_function))p <- plot_ly(x=x, y=y, z=z) %>% add_surface() %>%
layout(scene= list(xaxis= list(title="x_2"),
yaxis= list(title="x_3"),
zaxis= list(title="Prediction")))
p#Possible plots
f <- gjamSimData(n = 500, S = 10, typeNames = 'PA')
ml <- list(ng = 1000, burnin = 200, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
pl <- list(trueValues = f$trueValues, GRIDPLOTS = T)
gjamPlot(output = out, plotPars = pl)Species occurence in environmental space
df<-as.data.frame(cbind(f$xdata[,3],f$xdata[,2],f$y[,1]))
#plot(df,color=df$V3)
library(ggplot2)
#attach(df); plot(V1, V2, col=c("red","blue")[V3]); detach(df)
qplot(V1, V2, colour = V3,
data = df, main="species occurence in environmental space (V1,V2))")Plots for the latent normal variables
mean=out$parameters$wMu[1,1]; sd=out$parameters$wSd[1]
#lb=80; ub=120
mean2=out$parameters$wMu[1,2]; sd2=out$parameters$wSd[2]
mean3=out$parameters$wMu[1,3]; sd3=out$parameters$wSd[3]
mean4=out$parameters$wMu[1,4]; sd4=out$parameters$wSd[4]
mean5=out$parameters$wMu[1,5]; sd5=out$parameters$wSd[5]
x <- seq(-8,8,length=200)*sd + mean
hx <- dnorm(x,mean,sd)
x2 <- seq(-8,8,length=200)*sd + mean
hx2 <- dnorm(x2,mean2,sd2)
x3 <- seq(-8,8,length=200)*sd + mean
hx3 <- dnorm(x3,mean3,sd3)
x4 <- seq(-8,8,length=200)*sd + mean
hx4 <- dnorm(x4,mean4,sd4)
x5<- seq(-8,8,length=200)*sd + mean
hx5 <- dnorm(x5,mean5,sd5)
plot(x2, hx2,type="n",main="")
#points(x2,hx2,,main="Normal Distribution")
i <- x >= 0
#lines(x, hx)
lines(x2, hx2)
#lines(x3, hx3)
#lines(x4, hx4)
lines(x5, hx5)
polygon(c(0,x2[i],x[200]), c(0,hx2[i],0), col="grey")
polygon(c(0,x5[i],x[200]), c(0,hx5[i],0), col="grey")
area <- pnorm(x5[200], mean5, sd5) - pnorm(0, mean5, sd5)
result <- paste("P(Wi5 >",0,") =",
signif(area, digits=3),"|")
mtext(result,3)
axis(1, at=seq(40, 160, 20), pos=0)mu <- c(mean,mean2) # Mean
Sigma <- matrix(c(out$parameters$sigMu[1,1],out$parameters$sigMu[1,2], out$parameters$sigMu[2,1], out$parameters$sigMu[2,2]),2) # Covariance matrix
library(MASS)
# Generate sample from N(mu, Sigma)
bivn <- mvrnorm(5000, mu = mu, Sigma = Sigma ) # from Mass package
head(bivn)
# Calculate kernel density estimate
bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50)
# Contour plot overlayed on heat map image of results
image(bivn.kde) # from base graphics package
contour(bivn.kde, add = TRUE) # from base graphics packagelibrary(ellipse)
rho <- cor(bivn)
#plot(bivn, xlab = "X", ylab = "Y",
# col = "dark blue",
# main = "Bivariate Normal with Confidence Intervals")
plot(ellipse(rho, centre = mu), col="red", type="l") # ellipse() from ellipse package
lines(ellipse(rho, centre = mu,level = .99), col="green")
lines(ellipse(rho,centre = mu, level = .90), col="blue")
#abline(y_on_x)
#abline(x_on_y, col="brown")
abline(h=0)
abline(v=0)#legend(3,1,legend=plot_legend,cex = .5, bty = "n")
# Three dimensional surface
# Basic perspective plot
persp(bivn.kde, phi = 45, theta = 30, shade = .1, border = NA) # from base graphics package# RGL interactive plot
#library(rgl)
#col2 <- heat.colors(length(bivn.kde$z))[rank(bivn.kde$z)]
#persp3d(x=bivn.kde, col = col2)
#Higher Dimensional Distributions
library(corrplot)
library(clusterGeneration)
mu <- out$parameters$wMu[1,]
#pdMat <- genPositiveDefMat(10,lambdaLow=10)
Sigma <- out$parameters$sigMu
dim(Sigma)
mvn <- mvrnorm(5000, mu = mu, Sigma = Sigma )
corrplot(cor(mvn),
method="ellipse",
tl.pos="n",
title="Matrix Correlations")Case study: Forset data plot
library(repmis)
d <- "https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True"
source_data(d)
xdata <- forestTraits$xdata[,c(1,2,8)]
formula<-as.formula( ~temp + deficit +soil)
y<- gjamReZero(forestTraits$treesDeZero)
treeYdata<- gjamTrimY(y,10)$y
dim(treeYdata)
rl <- list(r = 8, N = 20)
ml <- list(ng = 2500, burnin = 500, typeNames = 'DA', reductList = rl)
form <- as.formula( ~ temp*deficit + I(temp^2) + I(deficit^2) )
out <- gjam(form, xdata = xdata, ydata = treeYdata, modelList = ml)
specNames <- colnames(treeYdata)
specColor <- rep('black',ncol(treeYdata))
specColor[ c(grep('quer',specNames),grep('cary',specNames)) ] <- 'brown'
specColor[ c(grep('acer',specNames),grep('frax',specNames)) ] <- 'darkgreen'
specColor[ c(grep('abie',specNames),grep('pice',specNames)) ] <- 'blue'
pl <- list(GRIDPLOTS=T, specColor = specColor)
gjamPlot(output = out, plotPars = pl)# Case study: aravo data set